Maintaining a healthy and consistent diet accounts for our most basic and important needs. Access to food sources, however, may be limited for many people, leading to issues of food insecurity. In 2010, 14.5% (17.2 million households) in the U.S. were food insecure, 5.4% of which had very low food security (Coleman-Jensen et al. (2011)).
Many are left to ask: “Where’s my next meal going to come from?”
Goals
Accessibility to food can play an important role in determining whether households have a good and consistent supply of nutritious food. Food accessibility is associated with a person’s proximity to grocery stores, food markets, and other food sources. In this project, I analyze the level of food accessibility for households at the county level and explore how it is linked to various food environment factors, such as income, population demographics, the availability of food assistance programs, and food insecurity. Additionally, I look at the changes in food accessibility from 2010 to 2015, and investigate whether changes in other food environment factors have an impact on food access.
Data
For this project, I use the Food Environment Atlas for 9/10/2020. The dataset has been compiled by the US Department of Agriculture’s (USDA) Economic Research Service (ERS) to study factors that affect food choices and the accessibility of healthy foods in communities. The information in the dataset had been aggregated from various reports from the USDA, the Bureau of the Census, the U.S. Department of Commerce, and more for the years 2006 to 2019. Food accessibility population data, which is the focus of the project, were given for 2010 and 2015.
The dataset contains county-level information on food environment factors such as access to grocery stores/supermarkets/restaurants, local food sales, food prices, food assistance programs like SNAP (Supplemental Nutrition Assistance Program), National School Lunch Program, etc., socioeconomic characteristics, and some health/physical activities. State-level information on household food insecurity is also given. The dataset contained data on 3143 counties, each uniquely identified with their FIPS code.
To enable geographic visualizations and analysis, an additional file from the ArcGIS Hub containing geographic boundary shapes is used.
Preprocessing
The information in the dataset was stored in separate worksheets, grouped according to the category of the features. Since there was a large number of features, the first step was to perform feature reduction. This was done by manually selecting the features that were related to the project’s questions and goals. Features that contained very granular information were also omitted in order to simplify analysis as well as to obtain more general insights.
The dataset and boundary shape file were also compared to check if county FIPS codes and names matched for all observations. Some county names and FIPS codes had changed over the years, which caused discrepancies between the two datasets. These counties were identified and modified to reflect the changes and to ensure observations matched. This brought the number of observations from 3143 to 3142 counties.
For columns with a low number of missing values, imputation was done using the median of those features. This was done while looking at variable distributions to ensure that imputation did not significantly change their distributions. However, median imputation was not appropriate for some 2015 variables that had a larger number of missing values. For these variables, missing values were replaced with numbers from the corresponding feature for 2010.
Overall, the data preprocessing involved merging data, selecting relevant variables and imputing missing values while exploring the nature of the variables. The final cleaned dataset had 3142 rows and 100 columns. A few rows of the final dataset are shown below.
Code
import pandas as pddf = pd.read_pickle(r'./../data/AtlasCleaned.pkl')df.head()
FIPS
State
County
LACCESS_POP10
LACCESS_POP15
PCH_LACCESS_POP_10_15
PCT_LACCESS_POP10
PCT_LACCESS_POP15
LACCESS_LOWI10
LACCESS_LOWI15
...
PCT_HISP10
PCT_NHASIAN10
PCT_NHNA10
PCT_NHPI10
PCT_65OLDER10
PCT_18YOUNGER10
MEDHHINC15
POVRATE15
CHILDPOVRATE15
METRO13
0
01001
AL
Autauga
18428.439685
17496.693038
-5.056026
33.769657
32.062255
5344.427472
6543.676824
...
2.400542
0.855766
0.397647
0.040314
11.995382
26.777959
56580.0
12.7
18.8
1
1
01003
AL
Baldwin
35210.814078
30561.264430
-13.204891
19.318473
16.767489
9952.144027
9886.831137
...
4.384824
0.735193
0.628755
0.043343
16.771185
22.987408
52387.0
12.9
19.6
1
2
01005
AL
Barbour
5722.305602
6069.523628
6.067799
20.840972
22.105560
3135.676086
2948.790251
...
5.051535
0.389700
0.218524
0.087409
14.236807
21.906982
31433.0
32.0
45.2
0
3
01007
AL
Bibb
1044.867327
969.378841
-7.224696
4.559753
4.230324
491.449066
596.162829
...
1.771765
0.096007
0.279293
0.030548
12.681650
22.696923
40767.0
22.2
29.3
1
4
01009
AL
Blount
1548.175559
3724.428242
140.568857
2.700840
6.497380
609.027708
1650.959482
...
8.070200
0.200621
0.497191
0.031402
14.722096
24.608353
50487.0
14.7
22.2
1
5 rows × 100 columns
Analysis
Getting Started
To investigate the relationship between food accessibility and food environment factors, I utilized visualizations and regression models as the primary analysis methods. The analysis requires the following Python library packages:
import warningswarnings.filterwarnings("ignore")import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport plotly.express as pximport plotly.io as pioimport json# regressionfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_errorfrom sklearn.tree import DecisionTreeRegressorimport statsmodels.api as smfrom sklearn.linear_model import LinearRegression
Visualizations
Food Accessibility
For this project, low accessibility populations are defined to be those “living more than 1 mile from a supermarket or large grocery store if in an urban area, or more than 10 miles from a supermarket or large grocery store if in a rural area” (Economic Research Service (n.d.)). The dataset provides information on the percentage of the county population with low food accessibility. Figure 1 shows the distribution of this variable.
The variable here represents the percentage of the county population that has limited access to food due to living far from food stores. For instance, a value of 40% for a county would indicate that 40% of its population has poor access to food due to living far from stores.
Now, Figure 1 shows the distribution of low food access population percentages across different counties. Each bar represents the number of counties that have a certain range of low access population percentage. For instance, the first bar on the left indicates that there are around 350 counties where 0-5% of their population has limited access to food.
Moving from left to right on the horizontal axis, the level of food accessibility decreases, with the rightmost bar representing counties where almost 100% of their residents have poor access to food i.e. live far from food stores.
Figure 1: Distribution of County % Population with Low Food Accessibility
The distribution plot reveals an intriguing characteristic of food accessibility in the U.S. Specifically, it illustrates that there are two distinct groups of counties based on the extent of food accessibility: those with low accessibility and those that are worse off with extremely low accessibility. The left peak in the graph represents the first group of counties, for which less than 40% of the populations have limited access to food. Majority of the counties fall under this group. However, the second peak reveals that there are also counties facing worse food accessibility issues. For approximately 130 counties, the entire population (up to 100%) has limited access to food sources.
Geographic Visualization
To gain a better spatial understanding of the location of low-accessibility counties in the U.S., I plotted the variable on a chloropleth map.
The map displays the percentage of the population with limited food access on a map. The color scale ranges from 0 to 100%, with the darkest shade representing 0% and the brightest shade representing 100%. Therefore, the bright yellow counties on the map indicate areas where almost 100% of the population has limited access to food due to living far from stores.
Yellow counties have bad food access and the blue ones have better access.
Figure 2: Low Food Accessibility Population Percentages in 2010
Figure 2 shows that low food accessibility counties are concentrated in the Midwest, Southwest, and West regions of the country. This could be due to the remote and isolated geography of these areas, resulting in a scarcity of food stores. Conversely, urban and densely populated regions, such as the Eastern and coastal areas, tend to have better food access. Hence, geographical location appears to play a role in determining the level of food accessibility in a county, as it is defined based on the proximity of households to food stores.
Change in Food Accessibility 2010-2015
Next, I investigate how the population level of low food accessibility changed in counties from 2010 to 2015. To analyze this, I created a variable that accounts for the difference by subtracting the 2010 values from the 2015 values of the percentage of the county population with low food accessibility. This variable is then visualized using a county-level chloropleth map.
Change = 2015 value - 2010 value of Low Food Access Population %
Red Counties –> Food access gets worse in 2015 (Change > 0 because 2015 number was greater; a greater number means that more people did not have good food accessiblity)
Figure 3: Change in Low Food Access Pop. % from 2010 to 2015
The chloropeth map shows that while there was no significant change in most counties, there were some counties that experienced remarkable improvements or deteriorations in food accessibility. These counties are denoted by the blue and red colors, respectively. Interestingly, these counties are concentrated in the Midwest and West regions of the country, similar to the location of low-access counties shown in the previous map.
Food and Vehicle Access
The dataset measures food accessibility based on the proximity of residential areas to food stores. However, it is important to note that the availability of vehicles can also play a significant role in determining food accessibility. While some populations may ive far from stores, having a car or access to transportation can help residents travel greater distances to purchase food.
To take account of this additional variable, we look at the density plot of household vehicle availability and county food accessibility. The blocks farther away from the axes represent the areas where both food accessibility and vehicle availability for the populations living there are low. These counties are likely to have the biggest challenges of food accessibility.
Code
def plt_dis(c): f = sns.displot(data=df, x=c, y='PCT_LACCESS_POP10', height=4, aspect=10/8.27, bins=10) plt.xlim([-5, 100]) plt.ylabel("Low Food Access Pop. %", alpha=0.9) plt.xlabel("Household % with Low Food Access & No Vehicle", alpha=0.8) plt.show()plt_dis('PCT_LACCESS_HHNV10')
Figure 4: Low Food Access Population and Household Vehicle Availability
In other counties, although food accessibility is bad, most households have access to vehicles (shown by the few dark areas). For most counties, 0-40% of residents have low food accessibility, but only less than 10% of households with low food access do not have vehicles. In other words, 90% of residents with low food accessibility do have access to a car.
Additionally, for some of the counties with bad food access, only a small proportion of households did not have access to cars (shown by top left boxes). This could mean that the problem of food accessibility in these counties may be lower than depicted earlier. Food accessibility may be the most problematic in only a few counties (shown by the boxes farther away from the axes).
Income and Food Access
Income is another crucial factor that can impact access to food. Low-income groups may be more sensitive to food prices and the lack of food availability in close proximity to their homes. The scatter plot charts the population with low food accessibility and populations with both low income and low food access. Here, low income is defined as “annual family income of less than or equal to 200 percent of the Federal poverty threshold based on family size” (Economic Research Service (n.d.)).
Code
sns.set(color_codes=True)f, ax = plt.subplots(figsize=(6, 5))sns.scatterplot(data=df, x='PCT_LACCESS_POP10', y='PCT_LACCESS_LOWI10')ax.set_ylabel("County Population % with Low Access and Low Income", size =12, alpha=0.9)ax.set_xlabel("County Population % with Low Food Access",size =12, alpha=0.9)ax.tick_params(axis='both', which='major', labelsize=14)plt.xlim([0, 102])xpoints = ypoints = ax.get_xlim()ax.plot(xpoints, ypoints, linestyle='--', color='cornflowerblue', lw=1, scalex=False, scaley=False)f.show()
Figure 5: Low Food Access Population and Household Vehicle Availability
The plot shows that the data points always fall below the identity function line. So, the proportion of population with both low food access and low income is less than the population percentage with low food access. This suggests that not all of the population with low food access have low income (as defined for the datset). However, the plot also shows that counties with worse food access have greater percentages of people with low access and low income. Thus, income can be correlated with food accessibility. Furthermore, the variation in these percentage values increases as accessibility worsens. This is also reminiscent of the variation in household availability for counties with bad food access.
Demographic Characteristics
To explore the racial composition of counties with very low food accessibility, the dataset was filtered to only include counties where more than 80% of the population had low food accessibility. The average population percentages of different race groups were then calculated for these counties.
Table 1 shows that in counties where more than 80% of the population has low food accessibility, the majority of the population is White, followed by Hispanic and American Indian / Alaska Native.
Code
from IPython.display import Markdownfrom tabulate import tabulate# Create a separate dataframe containing county info and race categoriesdf_race = pd.DataFrame(df, columns=['FIPS', 'State', 'County', 'PCT_LACCESS_POP10','PCT_NHWHITE10','PCT_NHBLACK10', 'PCT_HISP10', 'PCT_NHASIAN10','PCT_NHNA10','PCT_NHPI10', 'PCT_LACCESS_POP15', 'PCT_LACCESS_WHITE15','PCT_LACCESS_BLACK15', 'PCT_LACCESS_HISP15','PCT_LACCESS_NHASIAN15', 'PCT_LACCESS_NHNA15', 'PCT_LACCESS_NHPI15','PCT_LACCESS_MULTIR15'])# Store avg pct values for each race here, create table using this dictionarytable = []# all race categories as given by the datasetrace_catg = ["WHITE", "BLACK", "HISPANIC", "ASIAN", "AMER INDIAN / ALASKA NATIVE", "HAWAIIAN / PACIFIC ISLANDER", "MULTIRACIAL"]# 2010 very low access counties low_access10 = df_race[df['PCT_LACCESS_POP10']>80]# Update dictionary with 2010 mean values for very low access countiesfor (mean_pct,catg) inzip(low_access10.mean()[1:7],race_catg): table.append([catg, round(mean_pct, 2)])# 2015 low_access15 = df_race[df['PCT_LACCESS_POP15']>80]for (mean_pct,i) inzip(low_access15.mean()[8:],range(len(race_catg))): mean_pct =round(mean_pct, 2)try: table[i].append(mean_pct)exceptIndexError: table[-1].append(0) table[-1].append(mean_pct) # accounting for the extra MULTIRACIAL catg for 2015Markdown(tabulate(table, headers=["Race","2010", "2015"]))
Table 1: Racial Composition in Low Food Access Counties
Race
2010
2015
WHITE
84.93
85.05
BLACK
0.6
0.91
HISPANIC
9.26
9.92
ASIAN
0.3
0.34
AMER INDIAN / ALASKA NATIVE
3.63
5.61
HAWAIIAN / PACIFIC ISLANDER
0.03
0.06
Features’ Relationships
Next, we explore the relationship between food accessibility and two other variables in the dataset: food insecurity and the presence of food assistance programs. First, we examine if food accessibility variables are related to food insecurity. Second, we investigate if there is a relationship between the presence of food assistance programs and food accessibility levels.
Food Insecurity and Accessibility
Food-insecure households are those that “were unable, at times during the year, to provide adequate food for one or more household members because the household lacked money and other resources for food” (Economic Research Service (n.d.)). The values measuring food insecurity were given as three-year averages. In the following model, the food insecurity percent average for 2012-2014 is used as the response variable for the training set whereas the food insecurity percent average for 2015-2017 is used as the response variable for the test set. However, the dataset provides only state-level information for this feature. Therefore, the data frame used here was created by grouping the rows by state and taking the average. The final input data frame contains 51 rows with state-level information.
Response Variable is: State Food Insecurity percentage (three year average) Input Variables are the average by state of:
Low Food Access Population %
County Population % with Low Food Access and Low Income
Household % with Low Food Access and no Car
Training and test dataset are 2010 and 2015 data respectively
# create separate dataframes for 2010 and 2015df10 = pd.DataFrame(df_avg, columns = df10_cols)df15 = pd.DataFrame(df_avg, columns = df15_cols)# Standardizing dataframescaler = StandardScaler()df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)# Rename columns in both dataframes so all column names are commondf10.rename(columns =lambda x : str(x)[:-2], inplace=True)df15.rename(columns =lambda x : str(x)[:-2], inplace =True)# 2010 data as training dataX_train = df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK','PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 'PCT_65OLDER', 'PCT_18YOUNGER', 'VLFOODSEC_12_', 'FOODINSEC_12_'], axis=1)y_train = df10['FOODINSEC_12_']# 2015 as test dataX_test = df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE','PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR', 'VLFOODSEC_15_', 'FOODINSEC_15_'], axis=1)y_test = df15['FOODINSEC_15_']X_train = X_train.drop(columns=['PCT_FREE_LUNCH', 'PCT_REDUCED_LUNCH', 'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'PCT_SNAP', 'SNAP_PART_RATE', 'PCT_NSLP', 'PCT_SBP','PCT_SFSP', 'FDPIR', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)X_test = X_test.drop(columns=['PCT_FREE_LUNCH', 'PCT_REDUCED_LUNCH', 'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'PCT_SNAP', 'SNAP_PART_RATE', 'PCT_NSLP', 'PCT_SBP','PCT_SFSP', 'FDPIR', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)# Linear regression modelX = sm.add_constant(X_train)model1 = sm.OLS(y_train, X).fit()model1.summary()# getting test scoremodel2 = LinearRegression()model2.fit(X_train,y_train)# OUtput print('R-squared:', model1.rsquared)print('Model Test R-squared:', model2.score(X_test,y_test))print()print("The p-values for the linear regression model are as follows:")print('P-values:', model1.pvalues)
R-squared: 0.10953682615612048
Model Test R-squared: 0.018688402604447707
The p-values for the linear regression model are as follows:
P-values: const 1.000000
PCT_LACCESS_POP 0.025167
PCT_LACCESS_LOWI 0.023915
PCT_LACCESS_HHNV 0.592164
dtype: float64
The linear regression returns a low \(r^2\) score. Although this may be partly attributed to the small training set and the availability of only state-level food insecurity data, the extremely low \(r^2\) suggests that food accessibility is not strongly correlated with food insecurity. This means that households living far from food stores may still have good food security, and distance from food stores does not entirely dictate the level of food insecurity. Looking at the p-values, the food access population and income variables have the lowest p-values suggesting that they are the more significant variables in determining food insecurity.
Furthermore, the regression model used to predict state food insecurity for 2015 returns a lower \(r^2\), which suggests that changes in food accessibility do not fully account for changes in food insecurity. Other factors not present in the model and the dataset may be needed for determining food insecurity levels in a state.
Food Assistance Programs
The availability of food assistance programs in the county can help with improving food accessibility in the region. We use a regression model to study the relationship between food assistance accessibility and the numerous variables related to food assistance programs provided in the dataset. Our goal is to determine if these programs are available in areas with low food accessibility where the need for such programs may be the greatest.
Regression models are created separately for 2010 and 2015 to see how the availability of food assistance programs and food accessibility are related for both years. Following the previous analysis, the same state-level data frame is used for the regression models.
The response variable is the state average Low Food Access Population Percentage.
The input variables here are the state-level averages of:
Students eligible for free lunch (%)
Students eligible for reduced-price lunch (%)
SNAP participants (% pop)
SNAP benefits per capita (% change)
National School Lunch Program participants (%)
School Breakfast Program participants (% children)
Summer Food Service Program participants (% children)
FDPIR (Food Distribution Program on Indian Reservations) sites
For both years, the \(r^2\) scores are similar and less than 0.5. This suggests that the prevalence of food assistance programs is only partly correlated with food accessibility. To check if the low \(r^2\) is due to variation within states, we train and apply the regression model at the county level.
The county-level model performs worse than the state-level model, as shown by the lower \(r^2\) scores. This suggests that food assistance programs may not be available in all counties where food access is difficult. However, the better performing state-level model suggests that people can travel within the state to a different county to access these food assistance programs. Therefore, food asssistance programs in the country may still be effective if traveling within state is not an issue.
The previous analysis showed that food insecurity is not entirely limited by accessibility. If this is the case and if food assitance programs are available in nearby counties, low food accessibility populations may still be able to maintain better food security.
Change in Food Accessibility (2010-15)
Lastly, I use a regression model to analyze the change in food accessibility across U.S. counties. The response variable is the difference between the food accessibility population percentage for 2010 and 2015, calculated by subtracting the 2010 values from the 2015 values. This is the same variable that was used earlier to generate the geographic visualization of the change in accessibility.
The predictor variables used here are listed below along with their descriptions. The variables encompass information such as the number of food stores, farm sales, expernditure on other food sources like restaurants, food insecurity levels, and access to food assistance programs.
Code
import pandas as pdtable = {'PCT_REDUCED_LUNCH': "Students eligible for reduced-price lunch (%)",'GROCPTH': "Grocery stores/1,000 pop",'SUPERCPTH':"Supercenters & club stores/1,000 pop",'CONVSPTH':"Convenience stores/1,000 pop",'SPECSPTH':"Specialized food stores/1,000 pop",'SNAPSPTH':"SNAP-authorized stores/1,000 pop",'FSRPTH':"Fast-food restaurants/1,000 pop",'PC_FFRSALES':"Expenditures per capita, fast food",'PC_FSRSALES':"Expenditures per capita, restaurants",'PCT_SNAP':"SNAP participants (% pop)",'SNAP_PART_RATE':"SNAP participants (% eligible pop)",'PCT_NSLP':"National School Lunch Program participants (% children)",'PCT_SBP':"School Breakfast Program participants (% children)",'PCT_SFSP':"Summer Food Service Program participants (% children)",'FDPIR':"FDPIR Sites",'FOODINSEC':"State Household food insecurity (%, three-year average",'VLFOODSEC':"State Household very low food insecurity (%, three-year average","DIRSALES_FARMS":"Farms wth direct sales",'FMRKTPTH':"Farmers' markets/1,000 pop"}var = pd.DataFrame.from_dict(table, orient ='index')var
0
PCT_REDUCED_LUNCH
Students eligible for reduced-price lunch (%)
GROCPTH
Grocery stores/1,000 pop
SUPERCPTH
Supercenters & club stores/1,000 pop
CONVSPTH
Convenience stores/1,000 pop
SPECSPTH
Specialized food stores/1,000 pop
SNAPSPTH
SNAP-authorized stores/1,000 pop
FSRPTH
Fast-food restaurants/1,000 pop
PC_FFRSALES
Expenditures per capita, fast food
PC_FSRSALES
Expenditures per capita, restaurants
PCT_SNAP
SNAP participants (% pop)
SNAP_PART_RATE
SNAP participants (% eligible pop)
PCT_NSLP
National School Lunch Program participants (% ...
PCT_SBP
School Breakfast Program participants (% child...
PCT_SFSP
Summer Food Service Program participants (% ch...
FDPIR
FDPIR Sites
FOODINSEC
State Household food insecurity (%, three-year...
VLFOODSEC
State Household very low food insecurity (%, t...
DIRSALES_FARMS
Farms wth direct sales
FMRKTPTH
Farmers' markets/1,000 pop
All variables are standardized before training and applying the regression model.
Note
The income and household vehicle availability variables are removed here since they are largely correlated with the food accessibility response variables.
Code
# create 2010 and 2015 dataframes for all countiesdf10 = pd.DataFrame(df, columns = df10_cols)df15 = pd.DataFrame(df, columns = df15_cols)# Rename columnsdf10.rename(columns =lambda x : str(x)[:-2], inplace=True)df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)df15.rename(columns =lambda x : str(x)[:-2], inplace =True)df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)# Drop columnsdf10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK','PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE','PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)# Create dataframe with differences between 2010 & 2015 values as columnsdiff = pd.DataFrame()for c in df10.columns: diff[c] = df15[c]-df10[c]# Scale all columnsscaler = StandardScaler()diff = pd.DataFrame(scaler.fit_transform(diff), columns=diff.columns)# Drop correlating variables and create train/test dataX = diff.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)y = diff['PCT_LACCESS_POP']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Apply linear regressionX = sm.add_constant(X_train)model = sm.OLS(y_train, X).fit()model.summary()print('R-squared:', model.rsquared)print('P-values:', model.pvalues)
The model used here does not produce good \(r^2\) scores. One reason for this may be that the counties for which there are significant changes are low in number and may seem like unusual values for the dataset. This is shown by the distribution plot shown in Figure 6.
Code
# Plot distribution of difference in low food access pop% def plt_dis(c): f = sns.displot(data=diff, x=c, height=4, aspect=10/8.27, bins=20) plt.ylabel("Number of Counties", alpha=0.8) plt.xlabel("Change in Low Food Access Pop. % (standardized)", alpha=0.8) plt.show()plt_dis('PCT_LACCESS_POP')
Figure 6: Distribution of Change in Low Access Population (Standardized)
Now, we create a model using data for only the counties that have significant changes in food accessibility. For this, the dataset was filtered to contain only counties for which the standardized Change in Food Accessibility was greater than 1.5 and less than -1.5.
We apply both linear regression and decision tree regression models and compare their performance.
Code
# filter data for extreme casesdf_extreme = diff[diff['PCT_LACCESS_POP'] >1.5]df_extreme = diff[diff['PCT_LACCESS_POP'] <-1.5]# Drop correlated variables and create train/test dataX = df_extreme.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)y = df_extreme['PCT_LACCESS_POP']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Linear regressionprint("Linear Regression")X = sm.add_constant(X_train)model = sm.OLS(y_train, X).fit()model.summary()print('R-squared:', model.rsquared)print('P-values:', model.pvalues)print()print()# Decision Treeprint("Decision Tree Regression")dt_model = DecisionTreeRegressor(random_state=42)dt_model.fit(X_train, y_train)dt_pred = dt_model.predict(X_test)dt_score = dt_model.score(X_test, y_test)print("R-aquared score:", dt_score)
The \(r^2\) score for the linear regression model improved significantly when considering only counties for which food accessibility changed notably. When using a decision tree regressor, the \(r^2\) score improved even further. It is possible that these counties with notable changes in food accessibility are also the same counties with extremely low food access, as seen in Figure 3 and Figure 2. This brings us back to the two distinct groups of counties shown in Figure 1; the counties with significant changes in food access may belong to the group of counties where food access tends to be very low. Analyzing the data based on these groups may lead to better results.
The smallest p-value here is for the FDPIR variable which measures the Food Distribution Program on Indian Reservations. This variable may have been significant in the model because the extreme counties filtered here fall in regions where Native American populations are more significant.
Conlusion
This project analyzed food accessibility and other food environment factors. The analysis showed that counties with low food accessibility can be divided into two groups: those with low and very low food accessibility. The counties with very low food accessibility are located in the West and Midwest regions of the country, with populations primarily consisting of White, Hispanic, and Native American/ Alaska Native people. Additionally, the analysis also revealed a correlation between income and food accessibility, as a significant proportion of the population in counties with low food accessibility had low incomes.
Initially, all counties were found to have low food accessibility. However, examining household vehicle availability suggested that the problem may be less severe, as the majority of counties with poor food access had access to cars. There were only a few counties in which low food accessibility coincided with a lack of available vehicles.
Food accessibility remained largely remained the same from 2010 to 2015, with significant changes only occurring in counties with very low food accessibility. The regression model used here tried to look explain the changes by looking at how number of food stores/sources and assistance programs changed. However, these factors were not sufficient build an accurate model which is likely due to the absence of significant variables in the dataset.
Finally, regression models were used to assess the relationship between food insecurity and food accessibility, as well as other food environment factors. The analysis revealed that food insecurity cannot be fully explained by food accessibility alone. Additional factors such as income, poverty rates, and unemployment levels may also be significant. However, these variables could not be included in the current models due to their absence from the dataset or overlap with the response variables. Future analysis could incorporate such variables from other datasets to gain a more comprehensive understanding of food insecurity.
References
Coleman-Jensen, Alisha, Mark Nord, Margaret Andrews, and Steven Carlson. 2011. “Household Food Security in the United States in 2010.”ERR-125, U.S. Dept. Of Agriculture, Econ. Res. Serv.
Economic Research Service, U. S. Department of Agriculture (USDA). n.d. “Food Environment Atlas Documentation.”
Source Code
---title: "Food Accessibility in U.S. Counties"author: "Tenzin Sherpa"bibliography: references.bibnumber-sections: falseformat: html: theme: flatly rendering: default toc: true toc-expand: 4 toc-title: Contents code-fold: true code-tools: true smooth-scroll: true font-family: Georgia, serif; pdf: defaultjupyter: python3---# The IssueMaintaining a healthy and consistent diet accounts for our most basic and important needs. Access to food sources, however, may be limited for many people, leading to issues of food insecurity. In 2010, 14.5% (17.2 million households) in the U.S. were food insecure, 5.4% of which had very low food security (@coleman). <pstyle="text-align: center;">Many are left to ask: *“Where’s my next meal going to come from?”*</p># GoalsAccessibility to food can play an important role in determining whether households have a good and consistent supply of nutritious food. Food accessibility is associated with a person's proximity to grocery stores, food markets, and other food sources. In this project, I analyze the level of food accessibility for households at the county level and explore how it is linked to various food environment factors, such as income, population demographics, the availability of food assistance programs, and food insecurity. Additionally, I look at the changes in food accessibility from 2010 to 2015, and investigate whether changes in other food environment factors have an impact on food access.# DataFor this project, I use the [Food Environment Atlas](https://www.ers.usda.gov/data-products/food-environment-atlas/data-access-and-documentation-downloads/) for 9/10/2020. The dataset has been compiled by the US Department of Agriculture's (USDA) Economic Research Service (ERS) to study factors that affect food choices and the accessibility of healthy foods in communities. The information in the dataset had been aggregated from various reports from the USDA, the Bureau of the Census, the U.S. Department of Commerce, and more for the years 2006 to 2019. Food accessibility population data, which is the focus of the project, were given for 2010 and 2015. The dataset contains county-level information on food environment factors such as access to grocery stores/supermarkets/restaurants, local food sales, food prices, food assistance programs like SNAP (Supplemental Nutrition Assistance Program), National School Lunch Program, etc., socioeconomic characteristics, and some health/physical activities. State-level information on household food insecurity is also given. The dataset contained data on 3143 counties, each uniquely identified with their FIPS code.To enable geographic visualizations and analysis, an additional file from the [ArcGIS Hub](https://hub.arcgis.com/datasets/esri::usa-counties-generalized/explore?location=35.782943%2C-127.797159%2C3.76) containing geographic boundary shapes is used.## PreprocessingThe information in the dataset was stored in separate worksheets, grouped according to the category of the features. Since there was a large number of features, the first step was to perform feature reduction. This was done by manually selecting the features that were related to the project's questions and goals. Features that contained very granular information were also omitted in order to simplify analysis as well as to obtain more general insights.The dataset and boundary shape file were also compared to check if county FIPS codes and names matched for all observations. Some county names and FIPS codes had changed over the years, which caused discrepancies between the two datasets. These counties were identified and modified to reflect the changes and to ensure observations matched. This brought the number of observations from 3143 to 3142 counties. For columns with a low number of missing values, imputation was done using the median of those features. This was done while looking at variable distributions to ensure that imputation did not significantly change their distributions. However, median imputation was not appropriate for some 2015 variables that had a larger number of missing values. For these variables, missing values were replaced with numbers from the corresponding feature for 2010. Overall, the data preprocessing involved merging data, selecting relevant variables and imputing missing values while exploring the nature of the variables. The final cleaned dataset had 3142 rows and 100 columns. A few rows of the final dataset are shown below. ```{python, echo=False}import pandas as pddf = pd.read_pickle(r'./../data/AtlasCleaned.pkl')df.head()```# Analysis**Getting Started**To investigate the relationship between food accessibility and food environment factors, I utilized visualizations and regression models as the primary analysis methods. The analysis requires the following Python library packages:```{python}#| code-fold: falseimport warningswarnings.filterwarnings("ignore")import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport plotly.express as pximport plotly.io as pioimport json# regressionfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_errorfrom sklearn.tree import DecisionTreeRegressorimport statsmodels.api as smfrom sklearn.linear_model import LinearRegression```## Visualizations### Food Accessibility For this project, low accessibility populations are defined to be those "living more than 1 mile from a supermarket or large grocery store if in an urban area, or more than 10 miles from a supermarket or large grocery store if in a rural area" (@documentation). The dataset provides information on the percentage of the county population with low food accessibility. @fig-accessdist10 shows the distribution of this variable.:::{.callout-tip}## To read @fig-accessdist10The variable here represents the percentage of the county population that has limited access to food due to living far from food stores. For instance, a value of 40% for a county would indicate that 40% of its population has poor access to food due to living far from stores. Now, @fig-accessdist10 shows the distribution of low food access population percentages across different counties. Each bar represents the number of counties that have a certain range of low access population percentage. For instance, the first bar on the left indicates that there are around 350 counties where 0-5% of their population has limited access to food. Moving from left to right on the horizontal axis, the level of food accessibility decreases, with the rightmost bar representing counties where almost 100% of their residents have poor access to food i.e. live far from food stores. :::```{python}#| label: fig-accessdist10#| fig-cap: "Distribution of County % Population with Low Food Accessibility"df = pd.read_pickle(r'./../data/AtlasCleaned.pkl')sns.set(style='darkgrid')def plt_dis(c): f = sns.displot(data=df, x=c, height=4, aspect=10/8.27, bins=20) plt.xlim([0, 100]) plt.ylabel("Number of Counties", alpha=0.8) plt.xlabel("Low Food Access Pop. %", alpha=0.8) plt.show()plt_dis('PCT_LACCESS_POP10')```The distribution plot reveals an intriguing characteristic of food accessibility in the U.S. Specifically, it illustrates that there are two distinct groups of counties based on the extent of food accessibility: those with low accessibility and those that are worse off with extremely low accessibility. The left peak in the graph represents the first group of counties, for which less than 40% of the populations have limited access to food. Majority of the counties fall under this group. However, the second peak reveals that there are also counties facing worse food accessibility issues. For approximately 130 counties, the entire population (up to 100%) has limited access to food sources.**Geographic Visualization**To gain a better spatial understanding of the location of low-accessibility counties in the U.S., I plotted the variable on a chloropleth map.:::{.callout-tip}## To read @fig-map10The map displays the percentage of the population with limited food access on a map. The color scale ranges from 0 to 100%, with the darkest shade representing 0% and the brightest shade representing 100%. Therefore, the bright yellow counties on the map indicate areas where almost 100% of the population has limited access to food due to living far from stores. *Yellow counties have bad food access and the blue ones have better access.* :::```{python}#| label: fig-map10#| fig-cap: "Low Food Accessibility Population Percentages in 2010"df = pd.read_pickle(r'./../data/AtlasCleaned.pkl')withopen(r'./../src/USA_Counties_(Generalized).geojson') as f: tract = json.load(f)def plot_chlpth(df, x, label, header): fig = px.choropleth(df, locations='FIPS', color=x, color_continuous_scale="Viridis", range_color=(0, 100), geojson=tract, featureidkey ="properties.FIPS", scope="usa", hover_data=['State', 'County'], labels=label, title=header) fig.update_layout(legend=dict(orientation="h", yanchor="top",y=0.5,xanchor="right",x=1)) fig.show()plot_chlpth(df, 'PCT_LACCESS_POP10', {'PCT_LACCESS_POP10':'Popn. %'}, None)```@fig-map10 shows that low food accessibility counties are concentrated in the Midwest, Southwest, and West regions of the country. This could be due to the remote and isolated geography of these areas, resulting in a scarcity of food stores. Conversely, urban and densely populated regions, such as the Eastern and coastal areas, tend to have better food access. Hence, geographical location appears to play a role in determining the level of food accessibility in a county, as it is defined based on the proximity of households to food stores.**Change in Food Accessibility 2010-2015**Next, I investigate how the population level of low food accessibility changed in counties from 2010 to 2015. To analyze this, I created a variable that accounts for the difference by subtracting the 2010 values from the 2015 values of the percentage of the county population with low food accessibility. This variable is then visualized using a county-level chloropleth map.:::{.callout-tip}## To read @fig-changeChange = 2015 value - 2010 value of Low Food Access Population % Red Counties --> Food access gets worse in 2015 (Change > 0 because 2015 number was greater; a greater number means that more people did not have good food accessiblity)Blue Counties --> Food access improves in 2015White Counties --> Food access stays same :::```{python}#| label: fig-change#| fig-cap: "Change in Low Food Access Pop. % from 2010 to 2015"df_diff = pd.DataFrame(df, columns = ['FIPS', 'State', 'County', 'PCT_LACCESS_POP10', 'PCT_LACCESS_POP15'])df_diff['diff'] = df['PCT_LACCESS_POP15'] - df['PCT_LACCESS_POP10']fig = px.choropleth(df_diff, locations='FIPS', color='diff', color_continuous_scale="RdBu_r", range_color=(df_diff['diff'].min(), df_diff['diff'].max()), geojson=tract, featureidkey ="properties.FIPS", scope="usa", hover_data=['State', 'County'], labels={'diff':'Change in %'})fig.update_layout(legend=dict(orientation="h", yanchor="top",y=0.5,xanchor="right",x=1))fig.show()```The chloropeth map shows that while there was no significant change in most counties, there were some counties that experienced remarkable improvements or deteriorations in food accessibility. These counties are denoted by the blue and red colors, respectively. Interestingly, these counties are concentrated in the Midwest and West regions of the country, similar to the location of low-access counties shown in the previous map.### Food and Vehicle Access The dataset measures food accessibility based on the proximity of residential areas to food stores. However, it is important to note that the availability of vehicles can also play a significant role in determining food accessibility. While some populations may ive far from stores, having a car or access to transportation can help residents travel greater distances to purchase food. To take account of this additional variable, we look at the density plot of household vehicle availability and county food accessibility. The blocks farther away from the axes represent the areas where both food accessibility and vehicle availability for the populations living there are low. These counties are likely to have the biggest challenges of food accessibility. ```{python}#| label: fig-car10#| fig-cap: "Low Food Access Population and Household Vehicle Availability"def plt_dis(c): f = sns.displot(data=df, x=c, y='PCT_LACCESS_POP10', height=4, aspect=10/8.27, bins=10) plt.xlim([-5, 100]) plt.ylabel("Low Food Access Pop. %", alpha=0.9) plt.xlabel("Household % with Low Food Access & No Vehicle", alpha=0.8) plt.show()plt_dis('PCT_LACCESS_HHNV10')```In other counties, although food accessibility is bad, most households have access to vehicles (shown by the few dark areas). For most counties, 0-40% of residents have low food accessibility, but only less than 10% of households with low food access do not have vehicles. In other words, 90% of residents with low food accessibility do have access to a car. Additionally, for some of the counties with bad food access, only a small proportion of households did not have access to cars (shown by top left boxes). This could mean that the problem of food accessibility in these counties may be lower than depicted earlier. Food accessibility may be the most problematic in only a few counties (shown by the boxes farther away from the axes).### Income and Food AccessIncome is another crucial factor that can impact access to food. Low-income groups may be more sensitive to food prices and the lack of food availability in close proximity to their homes. The scatter plot charts the population with low food accessibility and populations with both low income and low food access. Here, low income is defined as "annual family income of less than or equal to 200 percent of the Federal poverty threshold based on family size" (@documentation).```{python}#| label: fig-income10#| fig-cap: "Low Food Access Population and Household Vehicle Availability"sns.set(color_codes=True)f, ax = plt.subplots(figsize=(6, 5))sns.scatterplot(data=df, x='PCT_LACCESS_POP10', y='PCT_LACCESS_LOWI10')ax.set_ylabel("County Population % with Low Access and Low Income", size =12, alpha=0.9)ax.set_xlabel("County Population % with Low Food Access",size =12, alpha=0.9)ax.tick_params(axis='both', which='major', labelsize=14)plt.xlim([0, 102])xpoints = ypoints = ax.get_xlim()ax.plot(xpoints, ypoints, linestyle='--', color='cornflowerblue', lw=1, scalex=False, scaley=False)f.show()```The plot shows that the data points always fall below the identity function line. So, the proportion of population with both low food access and low income is less than the population percentage with low food access. This suggests that not all of the population with low food access have low income (as defined for the datset). However, the plot also shows that counties with worse food access have greater percentages of people with low access and low income. Thus, income can be correlated with food accessibility. Furthermore, the variation in these percentage values increases as accessibility worsens. This is also reminiscent of the variation in household availability for counties with bad food access.### Demographic CharacteristicsTo explore the racial composition of counties with very low food accessibility, the dataset was filtered to only include counties where more than 80% of the population had low food accessibility. The average population percentages of different race groups were then calculated for these counties.@tbl-race shows that in counties where more than 80% of the population has low food accessibility, the majority of the population is White, followed by Hispanic and American Indian / Alaska Native. ```{python}#| label: tbl-race#| tbl-cap: "Racial Composition in Low Food Access Counties"from IPython.display import Markdownfrom tabulate import tabulate# Create a separate dataframe containing county info and race categoriesdf_race = pd.DataFrame(df, columns=['FIPS', 'State', 'County', 'PCT_LACCESS_POP10','PCT_NHWHITE10','PCT_NHBLACK10', 'PCT_HISP10', 'PCT_NHASIAN10','PCT_NHNA10','PCT_NHPI10', 'PCT_LACCESS_POP15', 'PCT_LACCESS_WHITE15','PCT_LACCESS_BLACK15', 'PCT_LACCESS_HISP15','PCT_LACCESS_NHASIAN15', 'PCT_LACCESS_NHNA15', 'PCT_LACCESS_NHPI15','PCT_LACCESS_MULTIR15'])# Store avg pct values for each race here, create table using this dictionarytable = []# all race categories as given by the datasetrace_catg = ["WHITE", "BLACK", "HISPANIC", "ASIAN", "AMER INDIAN / ALASKA NATIVE", "HAWAIIAN / PACIFIC ISLANDER", "MULTIRACIAL"]# 2010 very low access counties low_access10 = df_race[df['PCT_LACCESS_POP10']>80]# Update dictionary with 2010 mean values for very low access countiesfor (mean_pct,catg) inzip(low_access10.mean()[1:7],race_catg): table.append([catg, round(mean_pct, 2)])# 2015 low_access15 = df_race[df['PCT_LACCESS_POP15']>80]for (mean_pct,i) inzip(low_access15.mean()[8:],range(len(race_catg))): mean_pct =round(mean_pct, 2)try: table[i].append(mean_pct)exceptIndexError: table[-1].append(0) table[-1].append(mean_pct) # accounting for the extra MULTIRACIAL catg for 2015Markdown(tabulate(table, headers=["Race","2010", "2015"]))```## Features' RelationshipsNext, we explore the relationship between food accessibility and two other variables in the dataset: food insecurity and the presence of food assistance programs. First, we examine if food accessibility variables are related to food insecurity. Second, we investigate if there is a relationship between the presence of food assistance programs and food accessibility levels.### Food Insecurity and Accessibility Food-insecure households are those that “were unable, at times during the year, to provide adequate food for one or more household members because the household lacked money and other resources for food” (@documentation). The values measuring food insecurity were given as three-year averages. In the following model, the food insecurity percent average for 2012-2014 is used as the response variable for the training set whereas the food insecurity percent average for 2015-2017 is used as the response variable for the test set. However, the dataset provides only state-level information for this feature. Therefore, the data frame used here was created by grouping the rows by state and taking the average. The final input data frame contains 51 rows with state-level information. Response Variable is: State Food Insecurity percentage (three year average)Input Variables are the average by state of:* Low Food Access Population %* County Population % with Low Food Access and Low Income* Household % with Low Food Access and no Car*Training and test dataset are 2010 and 2015 data respectively*```{python}# Selecting variables df.drop(labels=['LACCESS_POP10', 'LACCESS_POP15', 'PCH_LACCESS_POP_10_15', 'LACCESS_LOWI10', 'LACCESS_LOWI15','PCH_LACCESS_LOWI_10_15', 'LACCESS_HHNV10', 'LACCESS_HHNV15', 'PCH_LACCESS_HHNV_10_15','LACCESS_SNAP15', 'LACCESS_CHILD10', 'LACCESS_CHILD15', 'LACCESS_CHILD_10_15', 'LACCESS_SENIORS10', 'LACCESS_SENIORS15', 'PCH_LACCESS_SENIORS_10_15', 'LACCESS_WHITE15', 'LACCESS_BLACK15', 'LACCESS_HISP15', 'LACCESS_NHASIAN15', 'LACCESS_NHNA15', 'LACCESS_NHPI15', 'LACCESS_MULTIR15'], axis=1, inplace=True)df_avg = df.groupby(['State']).mean()df10_cols = ['PCT_LACCESS_POP10', 'PCT_LACCESS_LOWI10', 'PCT_LACCESS_HHNV10','PCT_LACCESS_CHILD10','PCT_LACCESS_SENIORS10', 'PCT_FREE_LUNCH10', 'PCT_REDUCED_LUNCH10','PCT_NHWHITE10', 'PCT_NHBLACK10', 'PCT_HISP10', 'PCT_NHASIAN10', 'PCT_NHNA10', 'PCT_NHPI10','PCT_65OLDER10', 'PCT_18YOUNGER10','GROCPTH11', 'SUPERCPTH11', 'CONVSPTH11', 'SPECSPTH11', 'SNAPSPTH12', 'FFRPTH11', 'FSRPTH11', 'PC_FFRSALES07','PC_FSRSALES07', 'PCT_SNAP12', 'SNAP_PART_RATE11', 'PCT_NSLP12','PCT_SBP12', 'PCT_SFSP12','FDPIR12','FOODINSEC_12_14','VLFOODSEC_12_14','DIRSALES_FARMS07', 'FMRKTPTH13']df15_cols = ['PCT_LACCESS_POP15', 'PCT_LACCESS_LOWI15', 'PCT_LACCESS_HHNV15', 'PCT_LACCESS_SNAP15', 'PCT_LACCESS_CHILD15', 'PCT_LACCESS_SENIORS15', 'PCT_LACCESS_WHITE15','PCT_LACCESS_BLACK15','PCT_LACCESS_HISP15', 'PCT_LACCESS_NHASIAN15', 'PCT_LACCESS_NHNA15', 'PCT_LACCESS_NHPI15', 'PCT_LACCESS_MULTIR15', 'PCT_FREE_LUNCH15', 'PCT_REDUCED_LUNCH15', 'GROCPTH16', 'SUPERCPTH16', 'CONVSPTH16', 'SPECSPTH16', 'SNAPSPTH17', 'FFRPTH16', 'FSRPTH16','PC_FFRSALES12', 'PC_FSRSALES12', 'PCT_SNAP17','SNAP_PART_RATE16', 'PCT_NSLP17', 'PCT_SBP17', 'PCT_SFSP17','FDPIR15', 'FOODINSEC_15_17','VLFOODSEC_15_17', 'DIRSALES_FARMS12', 'FMRKTPTH18']# Foodhub, metro 13, medhincome 15 'POVRATE15' are not here``````{python}# create separate dataframes for 2010 and 2015df10 = pd.DataFrame(df_avg, columns = df10_cols)df15 = pd.DataFrame(df_avg, columns = df15_cols)# Standardizing dataframescaler = StandardScaler()df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)# Rename columns in both dataframes so all column names are commondf10.rename(columns =lambda x : str(x)[:-2], inplace=True)df15.rename(columns =lambda x : str(x)[:-2], inplace =True)# 2010 data as training dataX_train = df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK','PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 'PCT_65OLDER', 'PCT_18YOUNGER', 'VLFOODSEC_12_', 'FOODINSEC_12_'], axis=1)y_train = df10['FOODINSEC_12_']# 2015 as test dataX_test = df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE','PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR', 'VLFOODSEC_15_', 'FOODINSEC_15_'], axis=1)y_test = df15['FOODINSEC_15_']X_train = X_train.drop(columns=['PCT_FREE_LUNCH', 'PCT_REDUCED_LUNCH', 'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'PCT_SNAP', 'SNAP_PART_RATE', 'PCT_NSLP', 'PCT_SBP','PCT_SFSP', 'FDPIR', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)X_test = X_test.drop(columns=['PCT_FREE_LUNCH', 'PCT_REDUCED_LUNCH', 'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'PCT_SNAP', 'SNAP_PART_RATE', 'PCT_NSLP', 'PCT_SBP','PCT_SFSP', 'FDPIR', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)# Linear regression modelX = sm.add_constant(X_train)model1 = sm.OLS(y_train, X).fit()model1.summary()# getting test scoremodel2 = LinearRegression()model2.fit(X_train,y_train)# OUtput print('R-squared:', model1.rsquared)print('Model Test R-squared:', model2.score(X_test,y_test))print()print("The p-values for the linear regression model are as follows:")print('P-values:', model1.pvalues)```The linear regression returns a low $r^2$ score. Although this may be partly attributed to the small training set and the availability of only state-level food insecurity data, the extremely low $r^2$ suggests that food accessibility is not strongly correlated with food insecurity. This means that households living far from food stores may still have good food security, and distance from food stores does not entirely dictate the level of food insecurity. Looking at the p-values, the food access population and income variables have the lowest p-values suggesting that they are the more significant variables in determining food insecurity.Furthermore, the regression model used to predict state food insecurity for 2015 returns a lower $r^2$, which suggests that changes in food accessibility do not fully account for changes in food insecurity. Other factors not present in the model and the dataset may be needed for determining food insecurity levels in a state. ### Food Assistance ProgramsThe availability of food assistance programs in the county can help with improving food accessibility in the region. We use a regression model to study the relationship between food assistance accessibility and the numerous variables related to food assistance programs provided in the dataset. Our goal is to determine if these programs are available in areas with low food accessibility where the need for such programs may be the greatest. Regression models are created separately for 2010 and 2015 to see how the availability of food assistance programs and food accessibility are related for both years. Following the previous analysis, the same state-level data frame is used for the regression models. The response variable is the state average Low Food Access Population Percentage.The input variables here are the state-level averages of: * Students eligible for free lunch (%)* Students eligible for reduced-price lunch (%)* SNAP participants (% pop)* SNAP benefits per capita (% change)* National School Lunch Program participants (%) * School Breakfast Program participants (% children)* Summer Food Service Program participants (% children) * FDPIR (Food Distribution Program on Indian Reservations) sites```{python}# create 2010 and 2015 dataframes for statesdf10 = pd.DataFrame(df_avg, columns = df10_cols)df15 = pd.DataFrame(df_avg, columns = df15_cols)scaler = StandardScaler()df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)df10.rename(columns =lambda x : str(x)[:-2], inplace=True)df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)df15.rename(columns =lambda x : str(x)[:-2], inplace =True)df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK','PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE','PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)# training and test datasets according to yearx10 = df10.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)y10 = df10['PCT_LACCESS_POP']x15 = df15.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)y15 = df15['PCT_LACCESS_POP']x10 = x10.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV','GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)x15 = x15.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV','GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)# 2010 modelX = sm.add_constant(x10)model = sm.OLS(y10, X).fit()model.summary()print("2010: ")print('R-squared:', model.rsquared)print('P-values:', model.pvalues)print()# 2015 modelX = sm.add_constant(x15)model = sm.OLS(y15, X).fit()model.summary()print("2015: ")print('R-squared:', model.rsquared)print('P-values:', model.pvalues)```For both years, the $r^2$ scores are similar and less than 0.5. This suggests that the prevalence of food assistance programs is only partly correlated with food accessibility. To check if the low $r^2$ is due to variation within states, we train and apply the regression model at the county level.```{python}# create 2010 and 2015 dataframes for statesdf10 = pd.DataFrame(df, columns = df10_cols)df15 = pd.DataFrame(df, columns = df15_cols)scaler = StandardScaler()df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)df10.rename(columns =lambda x : str(x)[:-2], inplace=True)df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)df15.rename(columns =lambda x : str(x)[:-2], inplace =True)df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK','PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE','PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)# training and test datasets according to yearx10 = df10.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)y10 = df10['PCT_LACCESS_POP']x15 = df15.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)y15 = df15['PCT_LACCESS_POP']x10 = x10.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV','GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)x15 = x15.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV','GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES','PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)# 2010 modelX = sm.add_constant(x10)model = sm.OLS(y10, X).fit()model.summary()print("2010: ")print('R-squared:', model.rsquared)print()# 2015 modelX = sm.add_constant(x15)model = sm.OLS(y15, X).fit()model.summary()print("2015: ")print('R-squared:', model.rsquared)```The county-level model performs worse than the state-level model, as shown by the lower $r^2$ scores. This suggests that food assistance programs may not be available in all counties where food access is difficult. However, the better performing state-level model suggests that people can travel within the state to a different county to access these food assistance programs. Therefore, food asssistance programs in the country may still be effective if traveling within state is not an issue.The previous analysis showed that food insecurity is not entirely limited by accessibility. If this is the case and if food assitance programs are available in nearby counties, low food accessibility populations may still be able to maintain better food security. ## Change in Food Accessibility (2010-15)Lastly, I use a regression model to analyze the change in food accessibility across U.S. counties. The response variable is the difference between the food accessibility population percentage for 2010 and 2015, calculated by subtracting the 2010 values from the 2015 values. This is the same variable that was used earlier to generate the geographic visualization of the change in accessibility.The predictor variables used here are listed below along with their descriptions. The variables encompass information such as the number of food stores, farm sales, expernditure on other food sources like restaurants, food insecurity levels, and access to food assistance programs.```{python}import pandas as pdtable = {'PCT_REDUCED_LUNCH': "Students eligible for reduced-price lunch (%)",'GROCPTH': "Grocery stores/1,000 pop",'SUPERCPTH':"Supercenters & club stores/1,000 pop",'CONVSPTH':"Convenience stores/1,000 pop",'SPECSPTH':"Specialized food stores/1,000 pop",'SNAPSPTH':"SNAP-authorized stores/1,000 pop",'FSRPTH':"Fast-food restaurants/1,000 pop",'PC_FFRSALES':"Expenditures per capita, fast food",'PC_FSRSALES':"Expenditures per capita, restaurants",'PCT_SNAP':"SNAP participants (% pop)",'SNAP_PART_RATE':"SNAP participants (% eligible pop)",'PCT_NSLP':"National School Lunch Program participants (% children)",'PCT_SBP':"School Breakfast Program participants (% children)",'PCT_SFSP':"Summer Food Service Program participants (% children)",'FDPIR':"FDPIR Sites",'FOODINSEC':"State Household food insecurity (%, three-year average",'VLFOODSEC':"State Household very low food insecurity (%, three-year average","DIRSALES_FARMS":"Farms wth direct sales",'FMRKTPTH':"Farmers' markets/1,000 pop"}var = pd.DataFrame.from_dict(table, orient ='index')var```All variables are standardized before training and applying the regression model.:::{.callout-note}The income and household vehicle availability variables are removed here since they are largely correlated with the food accessibility response variables.:::```{python}# create 2010 and 2015 dataframes for all countiesdf10 = pd.DataFrame(df, columns = df10_cols)df15 = pd.DataFrame(df, columns = df15_cols)# Rename columnsdf10.rename(columns =lambda x : str(x)[:-2], inplace=True)df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)df15.rename(columns =lambda x : str(x)[:-2], inplace =True)df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)# Drop columnsdf10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK','PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE','PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)# Create dataframe with differences between 2010 & 2015 values as columnsdiff = pd.DataFrame()for c in df10.columns: diff[c] = df15[c]-df10[c]# Scale all columnsscaler = StandardScaler()diff = pd.DataFrame(scaler.fit_transform(diff), columns=diff.columns)# Drop correlating variables and create train/test dataX = diff.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)y = diff['PCT_LACCESS_POP']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Apply linear regressionX = sm.add_constant(X_train)model = sm.OLS(y_train, X).fit()model.summary()print('R-squared:', model.rsquared)print('P-values:', model.pvalues)```The model used here does not produce good $r^2$ scores. One reason for this may be that the counties for which there are significant changes are low in number and may seem like unusual values for the dataset. This is shown by the distribution plot shown in @fig-extremediff. ```{python}#| label: fig-extremediff#| fig-cap: "Distribution of Change in Low Access Population (Standardized)"# Plot distribution of difference in low food access pop% def plt_dis(c): f = sns.displot(data=diff, x=c, height=4, aspect=10/8.27, bins=20) plt.ylabel("Number of Counties", alpha=0.8) plt.xlabel("Change in Low Food Access Pop. % (standardized)", alpha=0.8) plt.show()plt_dis('PCT_LACCESS_POP')```Now, we create a model using data for only the counties that have significant changes in food accessibility. For this, the dataset was filtered to contain only counties for which the standardized Change in Food Accessibility was greater than 1.5 and less than -1.5. We apply both linear regression and decision tree regression models and compare their performance.```{python}# filter data for extreme casesdf_extreme = diff[diff['PCT_LACCESS_POP'] >1.5]df_extreme = diff[diff['PCT_LACCESS_POP'] <-1.5]# Drop correlated variables and create train/test dataX = df_extreme.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)y = df_extreme['PCT_LACCESS_POP']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Linear regressionprint("Linear Regression")X = sm.add_constant(X_train)model = sm.OLS(y_train, X).fit()model.summary()print('R-squared:', model.rsquared)print('P-values:', model.pvalues)print()print()# Decision Treeprint("Decision Tree Regression")dt_model = DecisionTreeRegressor(random_state=42)dt_model.fit(X_train, y_train)dt_pred = dt_model.predict(X_test)dt_score = dt_model.score(X_test, y_test)print("R-aquared score:", dt_score)```The $r^2$ score for the linear regression model improved significantly when considering only counties for which food accessibility changed notably. When using a decision tree regressor, the $r^2$ score improved even further. It is possible that these counties with notable changes in food accessibility are also the same counties with extremely low food access, as seen in @fig-change and @fig-map10. This brings us back to the two distinct groups of counties shown in @fig-accessdist10; the counties with significant changes in food access may belong to the group of counties where food access tends to be very low. Analyzing the data based on these groups may lead to better results.The smallest p-value here is for the FDPIR variable which measures the Food Distribution Program on Indian Reservations. This variable may have been significant in the model because the extreme counties filtered here fall in regions where Native American populations are more significant. # ConlusionThis project analyzed food accessibility and other food environment factors. The analysis showed that counties with low food accessibility can be divided into two groups: those with low and very low food accessibility. The counties with very low food accessibility are located in the West and Midwest regions of the country, with populations primarily consisting of White, Hispanic, and Native American/ Alaska Native people. Additionally, the analysis also revealed a correlation between income and food accessibility, as a significant proportion of the population in counties with low food accessibility had low incomes.Initially, all counties were found to have low food accessibility. However, examining household vehicle availability suggested that the problem may be less severe, as the majority of counties with poor food access had access to cars. There were only a few counties in which low food accessibility coincided with a lack of available vehicles.Food accessibility remained largely remained the same from 2010 to 2015, with significant changes only occurring in counties with very low food accessibility. The regression model used here tried to look explain the changes by looking at how number of food stores/sources and assistance programs changed. However, these factors were not sufficient build an accurate model which is likely due to the absence of significant variables in the dataset. Finally, regression models were used to assess the relationship between food insecurity and food accessibility, as well as other food environment factors. The analysis revealed that food insecurity cannot be fully explained by food accessibility alone. Additional factors such as income, poverty rates, and unemployment levels may also be significant. However, these variables could not be included in the current models due to their absence from the dataset or overlap with the response variables. Future analysis could incorporate such variables from other datasets to gain a more comprehensive understanding of food insecurity.# References